While transformers have greatly boosted performance in semantic segmentation, domain adaptive transformers are not yet well explored. We identify that the domain gap can cause discrepancies in self-attention. Due to this gap, the transformer attends to spurious regions or pixels, which deteriorates accuracy on the target domain. We propose to perform adaptation on attention maps with cross-domain attention layers that share features between the source and the target domains. Specifically, we impose consistency between predictions from cross-domain attention and self-attention modules to encourage similar distribution in the attention and output of the model across domains, i.e., attention-level and output-level alignment. We also enforce consistency in attention maps between different augmented views to further strengthen the attention-based alignment. Combining these two components, our method mitigates the discrepancy in attention maps across domains and further boosts the performance of the transformer under unsupervised domain adaptation settings. Our model outperforms the existing state-of-the-art baseline model on three widely used benchmarks, including GTAV-to-Cityscapes by 1.3 percent point (pp), Synthia-to-Cityscapes by 0.6 pp, and Cityscapes-to-ACDC by 1.1 pp, on average. Additionally, we verify the effectiveness and generalizability of our method through extensive experiments. Our code will be publicly available.
translated by 谷歌翻译
虽然姿势估计是一项重要的计算机视觉任务,但它需要昂贵的注释,并且遭受了域转移的困扰。在本文中,我们调查了域自适应2D姿势估计的问题,这些估计会传输有关合成源域的知识,而无需监督。尽管最近已经提出了几个领域的自适应姿势估计模型,但它们不是通用的,而是专注于人姿势或动物姿势估计,因此它们的有效性在某种程度上限于特定情况。在这项工作中,我们提出了一个统一的框架,该框架可以很好地推广到各种领域自适应姿势估计问题上。我们建议使用输入级别和输出级线索(分别是像素和姿势标签)对齐表示,这有助于知识转移从源域到未标记的目标域。我们的实验表明,我们的方法在各个领域变化下实现了最先进的性能。我们的方法的表现优于现有的姿势估计基线,最高4.5%(PP),手部姿势估算高达7.4 pp,狗的动物姿势估计高达4.8 pp,而绵羊的姿势估计为3.3 pp。这些结果表明,我们的方法能够减轻各种任务甚至看不见的域和物体的转移(例如,在马匹上训练并在狗上进行了测试)。我们的代码将在以下网址公开可用:https://github.com/visionlearninggroup/uda_poseestimation。
translated by 谷歌翻译